Skip to content

feat(db) resource version cas#1292

Open
derekwaynecarr wants to merge 1 commit into
NVIDIA:mainfrom
derekwaynecarr:feat/db-resource-version-cas
Open

feat(db) resource version cas#1292
derekwaynecarr wants to merge 1 commit into
NVIDIA:mainfrom
derekwaynecarr:feat/db-resource-version-cas

Conversation

@derekwaynecarr
Copy link
Copy Markdown
Collaborator

@derekwaynecarr derekwaynecarr commented May 9, 2026

Summary

Add Compare-And-Swap (CAS) infrastructure for safe concurrent object mutations
and migrate critical paths to use it. This prevents lost updates in HA
deployments with multiple gateway replicas.

Core infrastructure:

  • Add resource_version field to ObjectMeta proto (uint64)
  • Add resource_version column to objects table (SQLite: INTEGER, Postgres: BIGINT)
  • Add WriteCondition enum (MustCreate, MatchResourceVersion, Unconditional)
  • Add PersistenceError::Conflict variant for version mismatch
  • Add Store::put_if() and Store::delete_if() CAS methods
  • Add Store::update_message_cas() with bounded retry for mutations
  • Implement CAS operations for both SQLite and Postgres backends
  • Hydrate resource_version on all typed reads (defaults to 1 for backfill)

Migrations:

  • Migrate policy mutations to CAS (draft operations, settings)
  • Migrate provider updates to CAS (credentials, config merging)
  • Migrate sandbox updates to CAS (phase transitions, status reconciliation)
  • Migrate compute status updates to CAS (driver watch event handling)

Database migrations backfill existing rows with resource_version = 1.
CAS updates increment atomically: resource_version = resource_version + 1.

gRPC handlers map PersistenceError::Conflict to ABORTED status code
to signal clients to retry with fresh data. Server-side retries use
bounded retry (5 attempts) with fresh reads on each iteration.

Test coverage includes concurrent update scenarios and handler-level
resource_version round-trip tests.

Related Issue

Fixes #1255

Changes

Testing

  • [x ] mise run pre-commit passes
  • [ x] Unit tests added/updated
  • [ x] E2E tests added/updated (if applicable)

Checklist

  • [ x] Follows Conventional Commits
  • [ x] Commits are signed off (DCO)
  • [ x] Architecture docs updated (if applicable)

@derekwaynecarr derekwaynecarr requested review from a team, maxamillion and mrunalp as code owners May 9, 2026 13:59
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@derekwaynecarr derekwaynecarr changed the title Feat/db resource version cas feat(db) resource version cas May 9, 2026
@derekwaynecarr derekwaynecarr force-pushed the feat/db-resource-version-cas branch from 30e8540 to 46c93c3 Compare May 10, 2026 01:38
Comment thread crates/openshell-server/src/grpc/provider.rs
Comment thread crates/openshell-server/src/grpc/provider.rs Outdated
Comment thread crates/openshell-server/src/grpc/policy.rs Outdated
Adds Compare-And-Swap (CAS) based optimistic concurrency control to
prevent lost updates in concurrent modification scenarios. This
implements client-driven CAS for update operations and proper atomic
create protection.

Database Changes:
- Add resource_version field to ObjectMeta proto message
- Add resource_version column to objects table (migration 005)
- Implement update_message_cas for single-attempt versioned updates
- Add WriteCondition::MatchResourceVersion for CAS enforcement
- Add PersistenceError::Conflict for version mismatch detection

Protected Operations:
- AttachSandboxProvider: uses client-driven CAS with
  expected_resource_version parameter
- DetachSandboxProvider: uses client-driven CAS with
  expected_resource_version parameter
- UpdateProvider: extracts resource_version from Provider.metadata,
  validates against current version
- UpdateConfig: uses client-driven CAS for policy backfill path
- CreateProvider: uses WriteCondition::MustCreate for atomic creation
- SSH session operations: proper CAS protection

CAS Modes:
- Client-driven (expected_version > 0): Client fetches resource,
  uses its version for update. Conflict returns ABORTED status.
- Server-driven (expected_version = 0): Server uses current DB
  version. Used for internal operations.

CLI Changes:
- Update attach/detach operations to fetch sandbox first and use
  its resource_version for CAS protection
- Add clear error messages for ABORTED status on CAS conflicts
- Add expected_resource_version: 0 to all UpdateConfig requests

Testing:
- 12 integration tests for concurrent modification scenarios
- Tests verify ABORTED status on version conflicts
- Coverage for all protected operations

Documentation:
- Update architecture/gateway.md with CAS design and semantics
- Document expected_version parameter modes
- List all client-driven CAS operations

Signed-off-by: Derek Carr <decarr@redhat.com>
@derekwaynecarr derekwaynecarr force-pushed the feat/db-resource-version-cas branch from 46c93c3 to 4931e83 Compare May 14, 2026 05:20
@derekwaynecarr
Copy link
Copy Markdown
Collaborator Author

@johntmyers ptal. updated per prior feedback, also updated all proto paths that required CAS enablement. the recently merged draft chunk work is not covered right now, i would like to leave that as a follow-on.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List responses still decode the stored protobuf payload directly, but the row resource_version is authoritative and the payload often contains 0 or the previous version. That means clients can receive stale/zero versions from list APIs and then get avoidable CAS failures or skip client-driven CAS with an unusable version.

Please hydrate metadata.resource_version = record.resource_version after decoding, or add a typed list_messages<T>() helper that mirrors get_message*. I see this pattern in ListProviders, ListSandboxes, and ListServices.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This startup-resume error path still bypasses the new CAS/resource-version discipline. resume_persisted_sandboxes decodes sandboxes from list payloads, then mark_sandbox_error writes the mutated sandbox back with put_message, which does not use the DB row version and does not advance resource_version. A real phase/status change can be persisted without clients detecting it via CAS.

Please update this path with update_message_cas or put_if(..., MatchResourceVersion(record.resource_version)) so the Error transition increments the row version like the other sandbox mutations.

sandbox.object_name()
)));
// Generate UUID for database row and update metadata.id to match
let sandbox_id = uuid::Uuid::new_v4().to_string();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_create_sandbox already generates and validates a sandbox ID before calling into ComputeRuntime, but this method replaces it with a second UUID. That means validation, logging, and any caller-side correlation around the first ID refer to an object ID that is never persisted or sent to the driver.

Please keep the server-constructed ID from the incoming Sandbox and use MustCreate with that ID, rather than regenerating it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(gateway): add DB-backed resource_version CAS for stored objects

2 participants